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Abstract — A novel regularizer of the PARAFAC decomposition 
factors capturing the tensor's rank is proposed in this paper, 
as the key enabler for completion of three-way data arrays 
with missing entries. Set in a Bayesian framework, the tensor 
completion method incorporates prior information to enhance its 
smoothing and prediction capabilities. This probabilistic approach 
can naturally accommodate general models for the data distribu- 
tion, lending itself to various fitting criteria that yield optimum 
estimates in the maximum-a-posteriori sense. In particular, two 
algorithms are devised for Gaussian- and Poisson-distributed 
data, that minimize the rank-regularized least-squares error and 
KuUback-Leibler divergence, respectively. The proposed tech- 
nique is able to recover the "ground-truth" tensor rank when 
tested on synthetic data, and to complete brain imaging and yeast 
gene expression datasets with 50% and 15% of missing entries 
respectively, resulting in recovery errors at — lOdB and — 15dB. 

Index Terms — Tensor, low-rank, missing data, Bayesian infer- 
ence, Poisson process. 



I. Introduction 

Imputation of missing data is a basic task arising in various 
Big Data applications as diverse as medical imaging lfT2l . 
bioinformatics |3||, as well as social and computer networking 
ifTOl . ifTTl . The key idea rendering recovery feasible is the 
"regularity" present among missing and available data. Low 
rank is an attribute capturing this regularity, and can be 
readily exploited when data are organized in a matrix. A 
natural approach to low-rank matrix completion problem is 
minimizing the rank of a target matrix, subject to a constraint 
on the error in fitting the observed entries Since rank 
minimization is generally NP-hard ||261 . the nuclear norm has 
been advocated recently as a convex surrogate to the rank lIlTI . 
Beyond tractability, nuclear-norm minimization enjoys good 
performance both in theory as well as in practice |)4]. 

The goal of this paper is imputation of missing entries of 
tensors (also known as multi-way arrays), which are high-order 
generalizations of matrices frequently encountered in chemo- 
metrics, medical imaging, and networking ||8l, |fT6l . Leveraging 
the low-rank structure for tensor completion is challenging, 
since even computing the tensor rank is NP-hard llT4l . Defining 
a nuclear norm surrogate is not obvious either, since singular 
values as defined by the Tucker decomposition are not gen- 
erally related with the rank. Traditional approaches to finding 
low-dimensional representations of tensors include unfolding 
the multi-way data and applying matrix factorizations such 
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as the singular-value decomposition (SVD) JS], Q, ll25l or, 
employing the parallel factor (PARAFAC) decomposition ||9l , 
Il24l . In the context of tensor completion, an approach falling 
under the first category can be found in llT2l . while imputation 
using PARAFAC was dealt with in 

The imputation approach presented in this paper builds on 
a novel regularizer accounting for the tensor rank, that relies 
on redefining the matrix nuclear norm in terms of its low-rank 
factors. The contribution is two-fold. First, it is established that 
the low-rank inducing property of the regularizer carries over 
to tensors by promoting sparsity in the factors of the tensor's 
PARAFAC decomposition. In passing, this analysis allows for 
drawing a neat connection with the atomic-norm in |5J. The 
second contribution is the incorporation of prior information, 
with a Bayesian approach that endows tensor completion with 
extra smoothing and prediction capabilities. A parallel analysis 
in the context of reproducing kernel Hilbert spaces (RKHS) 
further explains these acquired capabilities, provides an alter- 
native means of obtaining the prior information, and establishes 
a useful connection with collaborative filtering approaches |[T] 
when reduced to the matrix case. 

While least-squares (LS) is typically utilized as the fitting 
criterion for matrix and tensor completion, implicitly assuming 
Gaussian data, the adopted probabilistic framework supports 
the incorporation of alternative data models. Targeting count 
processes available in the form of network traffic data, genome 
sequencing, and social media interactions, which are modeled 
as Poisson distributed, the maximum a posteriori (MAP) es- 
timator is expressed in terms of the KuUback-Leibler (K-L) 
divergence ITOl . 

The remainder of the paper is organized as follows. Section 
nil offers the necessary background on nuclear-norm regular- 
ization for matrices, the PARAFAC decomposition, and the 
definition of tensor rank. Section [III] presents the tensor com- 
pletion problem, establishing the low-rank inducing property of 
the proposed regularization. Prior information is incorporated 
in Section IIVI with Bayesian and RKHS formulations of the 
tensor imputation method, leading to the low-rank tensor- 
imputation (LRTI) algorithm. Section |V] develops the method 
for Poisson tensor data, and redesigns the algorithm to min- 
imize the rank-regularized K-L divergence. Finally, Section 
rvTl presents numerical tests carried out on synthetic and real 
data, including expression levels in yeast, and brain magnetic 
resonance images (MRI). Conclusions are drawn in Section 
VII, while most technical details are deferred to the Appendix. 

The notation adopted throughout includes bold lowercase 
and capital letters for vectors a and matrices A, respectively, 
with superscript T denoting transposition. Tensors are under- 
lined as e.g., X, and their slices carry a subscript as in Xp; 
see also Fig. [T] Both the matrix and tensor Frobenius norms 
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are represented by || • \\p. Symbols (g), 0, ®, and o, denote 
the Kroneker, Kathri-Rao, Hadamard (entry-wise), and outer 
product, respectively. 

II. Preliminaries 

A. Nuclear-norm minimization for matrix completion 

Low-rank approximation is a popular method for estimating 
missing values of a matrix Z € W^'^^^ , which capitalizes on 
"regularities" across the data ifTTI . For the imputation to be 
feasible, a binding assumption that relates the available entries 
with the missing ones is required. An alternative is to postulate 
that Z has low rank R ^ min(iV, M). The problem of finding 
matrix Z with rank not exceeding R, which approximates Z in 
the given entries specified by a binary matrix A G {0,1}^^*^, 
can be formulated as 

Z = argimii||(Z - X)®A||^ s. to rank(X) < i? . (1) 

The low -rank property of matrix X implies that the vector s(X) 
of its singular values is sparse. Hence, the rank constraint is 
equivalent to ||s(X)||o < R, where the £o-(pseudo)norm j| • j|o 
equals the number of nonzero entries of its vector argument. 

Aiming at a convex relaxation of the NP-hard problem ([l), 
one can leverage recent advances in compressive sampling llTTl 
and surrogate the £o-norm with the £i-norm, which here equals 
the nuclear norm of X defined as ||X||h. := ||s(X)||i. With this 
relaxation, the Lagrangian counterpart of dl] is 

Z = argniin i|| (Z - X)®A||| + ^||X||, (2) 

where > is a rank-controlling parameter Problem (|2) can 
be further transformed by considering the following character- 
ization of the nuclear norm ||231 

m* = ,-^^^^^,\m\i + mi) s. tox = Bc^. o) 

For an arbitrary matrix X with SVD X = USV^, the mini- 
mum in © is attained for B = US^/^ and C = VSi/^. The 
optimization in (|3]l is over all possible bilinear factorizations 
of X, so that the number of columns of B and C is also 
a variable. Building on (|3]l, one can arrive at the following 
equivalent reformulation of ^ |]T7] 

Z' = arg min - X)®A|||, + g(||B|||, + 1|C||^) 

s. to X = BC^. (4) 

The equivalence implies that by finding the global minimum 
of (|4|, one can recover the optimal solution of (|2]i. However, 
since (|4| is nonconvex, it may have multiple stationary points. 
Interestingly, the next result provides conditions for these 
stationary points to be globally optimal (parts a) and b) are 
proved in the Appendix, while the proof for c) can be found 
in lini.) 

Proposition 1: Problems ^ and (|4| are equivalent, in the 
sense that: 

a) global minima coincide: Z = Z'; 

b) all local minima of (|4| are globally optimal; and, 

c) stationary points X o/ (|4| satisfying \\ (X— Z)® A|j2 < p. 
are globally optimal. 



X 




Fig. 1. Tensor slices along the row, column, and tube dimensions. 

This result plays a critical role in this paper, as the 
Frobenius-norm regularization for controlling the rank in (|4|l, 
will be useful to obtain its tensor counterparts in Section |III1 



B. PARAFAC decomposition 

The PARAFAC decomposition of a tensor X £ M^^''^''^ is 
at the heart of the proposed imputation method, since it offers 
a means to define its rank ||9|, lf24l . Given i? S N, consider 
matrices A G M^^-^, B e and C e R-P^^, such that 

R 

X(m, n,p) = ^ A(m, r)B(n, r)C(p, r). (5) 

r=l 

The rank of X is the minimum value of R for which this 
decomposition is possible. For R* := rank(X), the PARAFAC 
decomposition is given by the corresponding factor matrices 
{A, B, C} (all with R* columns), so that (H) holds with R = 

R*. 

To appreciate why the aforementioned rank definition is 
natural, rewrite (|5) as X = arob,.oc,., where a^, b^, and 

Cr represent the r-th columns of A, B, and C, respectively; 
and the outer products := a,, o o £ jjA/xWxp j^^yg 
entries (to, := A(m, r)B(n, r')C(p, r). The rank of a 
tensor is thus the minimum number of outer products (rank one 
factors) required to represent the tensor It is not uncommon 
to adopt an equivalent normaUzed representation 

R R 
X = a^ O br O Cr = 7r(Ur O O W^) (6) 

r— 1 r— 1 

by defining unit- norm vectors u,. := a,./j|ar j|, v,. b,./||b,.||, 
Wr := Cr/||c,.||, and weights 7^ := |jar|j ||br|| ||c,.||, r = 

Let Xp , p = 1 , . . . , P denote the p-th slice of X along its 
third (tube) dimension, such that Xp(TO,n) ^{m,n,p); 
see Fig. [T] The following compact form of the PARAFAC 
decomposition in terms of slice factorizations will be used in 
the sequel 

Xp = Adiag [eJC] B, p=l,...,P (7) 

where the diagonal matrix diag [u] has the vector u on its diag- 
onal, and ej is the p-th row of the P x P identity matrix. The 
PARAFAC decomposition is symmetric [cf. ©I, and one can 
also write Xm = Bdiag [e^jA] C, or, X„ = Cdiag [e^B] A 
in terms of slices along the first (row), or, second (column) 
dimensions. 
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III. Rank regularization for tensors 

Generalizing the nuclear-norm regularization technique dU 
from low-rank matrix to tensor completion is not straightfor- 
ward, since singular values of a tensor (given by the Tucker 
decomposition) are not related to the rank llT6l . Fortunately, 
the Frobenius-norm regularization outlined in Section III-AI 
offers a viable option for low-rank tensor completion under 
the PARAFAC model, by solving 



ml- 



IB 



s. to 



Xp = Adiag 



[eJC]B, p=h 



.P 



\C\\%) 
(8) 



where the Frobenius norm of a tensor is defined as ||X|If 

Sp^^("^' '^'-P)' Hadamard product as 

(X®A)(m, Xfm, n, p) A(m, n, p). 

Different from the matrix case, it is unclear whether the 
regularization in (|8) bears any relation with the tensor rank. 
Interestingly, the following analysis corroborates the capability 
of ^ to produce a low-rank tensor Z, for sufficiently large /i. 
In this direction, consider an alternative completion problem 
stated in terms of the normalized tensor representation (|6) 



1, 



Z := arg min -\\ (Z - X) ®A|||, + 



2/3 
2/3 



B. 



S. to X = 7r(Ur O Vr O Wr) (9) 

r=l 

the nonconvex £2/3 (pseudo)-norm 



where 7 := [71, ... ,77? 

is given by II7II2/3 ■— (X^^Li l7rP^'^)^^^; and the unit-norm 
constraint on the factors' columns is left implicit. Problems 
^ and (|9]l are equivalent as established by the following 
proposition (its proof is provided in the Appendix.) 

Proposition 2: The solutions of dHJ and (|9]l coincide, i.e., Z = 
Z, with optimal factors related by — \/j^Ur; — v^TVVr, 
and Cr ~ v^TVWr, r = 1, . . . , _R. 

To further stress the capability of (|8) to produce a low-rank 
approximant tensor X, consider transforming (|9) once more 
by rewriting it in the constrained-error form 



Z := are' min II7I 

- {X,7,{u.},{v,},{w,}}" 



2/3 



(10) 



s. to II (Z-X) ®A||^ < cr^ 



It 

X = 7r(Ur O Vr O W^). 



For any value of there exists a corresponding Lagrange 
multiplier A such that (|9]l and ( fTOl ) yield the same solution, 
under the identity /i = 2/ A. [Since f{x) = x^l'^ is an 
increasing function, the exponent of ||7|| 2/3 can be safely elim- 
inated without affecting the minimizer of (fTOll.1 The ^2/3-norm 
II7II2/3 in ([Tol l produces a sparse vector 7 when minimized ||6l, 
sharing this well-documented property of the ^i-norm as their 
norm-one balls, depicted in Fig. |2] share the "pointy geometry" 
which is responsible for inducing sparsity. 

With dHJ equivalently rewritten as in (fTOt . its low-rank 
inducing property is now revealed. As 7 in dTol i becomes 
sparse, some of its entries 7^ are zeroed, and the corresponding 
outer-products 7r(ar o o c^) drop from the sum in (|6), thus 
lowering the rank of X- 




Fig. 2. The l^jy^ovca ball compared to the Iq- and ^i-noiTn balls 

The next property is a direct consequence of the low-rank 
promoting property of (|8) as established in Proposition |2] 

Corollary 1: If %_ denotes the solution to problem (|8]i , and 
M > A^max := II A®Z||^^^ then %_^^my.n^p- 

Corollary [T] asserts that if the penalty parameter is cho- 
sen large enough, the rank is reduced to the extreme case 
rank(Z) = 0. To see why this is a non-trivial property, it is 
prudent to think of ridge-regression estimates where similar 
quadratic regularizers are adopted, but an analogous property 
does not hold. In ridge regression one needs to let /i — s- 00 in 
order to obtain an all-zero solution. Characterization of /imax 
is also of practical relevance as it provides a frame of reference 
for tuning the regularization parameter 

Using ( [Tol l, it is also possible to relate (O with the atomic 
norm in Q. Indeed, the infimum ^i-norm of 7 is a proper norm 
for X, named atomic norm, and denoted by ||Xl|.A := II7II1 0. 
Thus, by replacing II7II2/3 with ||X||.4i (fTol ) becomes convex 
in X- Still, the complexity of solving such a variant of (fTOl i 
resides in that ||X||^ is generally intractable to compute 15]. 
In this regard, it is remarkable that arriving to (ITOl had the 
sole purpose of demonstrating the low-rank inducing property, 
and that ([8]l is to be solved by the algorithm developed in 
the ensuing section. Such an algorithm will neither require 
computing the atomic norm or PARAFAC decomposition of 
X, nor knowing its rank. The number of columns in A, B, 
and C can be set to an overestimate of the rank of Z, such as 
the upper bound R := min{A/iV, 7VF, FA/} > rank(Z), and 
the low-rank of X will be induced by regularization as argued 
earlier To carry out a fair comparison, only convergence to a 
stationary point of ^ will be guaranteed in this paper 
Remark 1: These insights foster future research directions for 
the design of a convex regularizer of the tensor rank. Specif- 



ically, substituting (0(A, B,C) -.^ X]r=i(ll^'" 



|b,,| 



Crll^) for the regularization term in (|8}, turns II7II2/3 into 
II7II1 = Il2£l|yt in the equivalent ( fTOl i. It is envisioned that 
with such a modification in place, the acquired convexity of 
( [Tol l would enable a reformulation of Proposition [T] providing 
conditions for global optimality of the stationary points of ([8]). 

Still, a limitation of ([8]) is that it does not allow for incor- 
porating side information that could be available in addition to 
the given entries A®Z. 

Remark 2: In the context of recommender systems, a de- 
scription of the users and/or products through attributes (e.g.. 
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gender, age) or measures of similarity, is typically available. 
It is thus meaningful to exploit both known preferences and 
descriptions to model the preferences of users JT]. In three-way 
(samples, genes, conditions) microarray data analysis, the rel- 
ative position of single-nucleotide polymorphisms in the DNA 
molecule implies degrees of correlation among genotypes ll22ll . 
These correlations could be available either through a pre- 
scribed model, or, through estimates obtained using a reference 
tensor Z. A probabilistic approach to tensor completion capable 
of incorporating such types of extra information is the subject 
of the ensuing section. 

IV. Bayesian low-rank tensor approximation 
A. Bayesian PARAFAC model 

A probabilistic approach is developed in this section in order 
to integrate the available statistical information into the tensor 
imputation setup. To this end, suppose that the observation 
noise is zero-mean, white, Gaussian; that is 

Zmnp = Xmnp + emnp] SUCh that Cmnp 7V(0, CT^ ) , i.i.d.. 

(11) 

Since vectors in Q are interchangeable, identical dis- 
tributions are assigned across r = 1,...,R, and they are 
modeled as independent from each other, zero-mean Gaussian 
distributed with covariance matrix R^i G R^^^'*^. Similarly, 
vectors and are uncorrected and zero-mean, Gaussian, 
with covariance matrix and Rc, respectively. In addition 
ar, hr, and are assumed mutually uncorrected. And since 
scale ambiguity is inherently present in the PARAFAC model, 
vectors a^, b^, and c,. are set to have equal power; that is, 

:= Tt{Ra) = Tr(RB) = Tr(Rc). (12) 

Under these assumptions, the negative of the posterior dis- 
tribution can be readily written as cxp(— L(X)), with 

i(X) = x^ll(Z-X)®A|l| 

1 ^ 

+ - ^ (a^R^^a^ + h'^Rg^hr + c^R^^c^) 

r=l 

= ^ll(Z-X)®A||2, + i [Tr(A^R:,iA) 
+Tr (B^R^j^B) + Tr (C^Rp^C)] . 

Correspondingly, the MAP estimator of X is 

Z := arg min -ijIKZ - X)®A||| + ^ [Tr (A^R^^^A) 

+Tr (B^R^^B) + Tr (C^Rp^C)] 
s. to Xp = Adiag [eJC] B^, p=l,...,P (13) 

reducing to (|8]i when R^i — Im, Rb — Ijv, and Rc = Ip. 
This Bayesian approach interprets the regularization parameter 
^ [cf. ^] as the noise variance, which is useful in practice 
to select /i. The ensuing section explores the advantages of 
incorporating prior information to the imputation method. 



B. Nonparametric tensor decomposition 

Incorporating the information conveyed by R^, R^, and 
Rc, together with a practical means of finding these matrices 
can be facilitated by interpreting (fTSl l in the context of RKHS 
1271 . In particular, the analysis presented next will use the 
Representer Theorem, interpreted as an instrument for finding 
the best interpolating function in a Hilbert space spanned 
by kernels, just as interpolation with sinc-kemels is carried 
out in the space of bandlimited functions for the purpose of 
reconstructing a signal from its samples llT9l . 

In this context, it is instructive to look at a tensor f : Ai x 
AfxV ^ Ras a function of three variables m, n, and p, living 
in measurable spaces Ai , J\f, and V, respectively. Generalizing 
^ to this nonparametric framework, low-rank functions / are 
formally defined to belong to the following family 

R 

: A^xTVAxP^R : f{m,n,p) = a,. (to)5,. {n)cr [p) 

r=l 

such that a,.(m) e Hm^ br{n) £ Hj^f, Cr{p) G V.-p} 

where Hm, 'H-j\f, and Hp are Hilbert spaces constructed from 
specified kernels fcyn, kj^/ and k-p, defined over Ai, Af, and 
V, while R is an initial overestimate of the rank of /. 

The following nonparametric fitting criterion is adopted for 
finding the best //? interpolating data {z,„„p : Smnp = 1} 

M N P 

Jr :== arg min ^ 5mnp{zmnp - f{m,n,p) f 

rn—1 n—1 p—1 
R 

+ f E(II«'-|I«-M+1I^-II«.V + 1|C.|I«J ■ (14) 

r=l 

It is shown in the Appendix that leveraging the Representer 
Theorem, the minimizer of (fl4t admits a finite dimensional 
representation in terms of kM, kj\/ and k-p, 

fR{m,n,p) = fc^(m)Kxl Adiag [k'^{p)K:p'C] B^Kj/k^{n) 

(15) 

where vector kjVi(TO) and matrix K^vi have entries 
k_M{m,m'), m,m' = 1,...,A/; and where 'kj^f{n), K^v", 
kp(p), and are correspondingly defined in terms of kj\f 
and kp. It is also shown in the Appendix that the coefficient 
matrices A, B, and C can be found by solving 

p 

II (Zp - Adiag [eJC] B^) ®Ap||^ 

' ' p=i 

+ 1 (Tr(A^KxiA) + Tr(B^K^iB) + Tr(C^K^iC)) 

s. to A e R'^^^^, B e R^^^, C G R^^^. (16) 

Problem ( fTSI l reduces to (|8) when the side information is 
discarded by selecting kj^, k^/ and kp as Kronecker deltas, 
in which case K^vi, K^a, and are identity matrices. In 
the general case, ( fTSI l yields the sought nonlinear low-rank 
approximation method for f{m,n,p) when combined with 
dTsl l. evidencing the equivalence between (fT4l) and ( fTSI ). 

Interpreting (fl4l as an interpolator renders (ITjI a natural 
choice for tensor imputation, where in general, missing entries 
are to be inserted by connecting them to surrounding points on 
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the three-dimensional arrangement. Relative to (|8), this RKHS 
perspective also highlights (fT3]l's extra smoothing and extrap- 
olation capabilities. Indeed, by capitalizing on the similarities 
captured by K.m, and Kp, (fTSI l can recover completely 
missing slices. This feature is not shared by imputation meth- 
ods that leverage low-rank only, since these require at least one 
point in the slice to build on colinearities. Extrapolation is also 
possible in this sense. If for instance K^vi can be expanded to 
capture a further point M + 1 not in the original set, then 
a new slice of data can be predicted by (flST l based on its 
correlation fcjVi(A/ + 1) with the available entries. These extra 
capabilities will be exploited in Section [VTl where correlations 
are leveraged for the imputation of MRI data. The method 
described by (fT3] l and (fTSI l can be applied to matrix completion 
by just setting entries of C to one, and can be extended to 
higher-order dimensions with a straightforward alteration of 
the algorithms and theorems throughout this paper 

Identification of covariance matrices R^, Rs, and Rc with 
kernel matrices Kx, Kaa and is the remaining aspect to 
clarify in the connection between il3[ and (fTSI l. It is apparent 
from (fTsT l and (fT6] l that correlations between columns of the 
factors are reflected in similarities between the tensor slices, 
giving rise to the opportunity of obtaining one from the other. 
This aspect is explored next. 

C. Covariance estimation 

To implement dlJl l, matrices Ha, Rb, and Rc must be 
postulated a priori, or alternatively replaced by their sample 
estimates. Such estimates need a training set of vectors a, b, 
and c abiding to the Bayesian model just described, and this 
requires PARAFAC decomposition of training data. In order 
to abridge this procedure, it is convenient to inspect how TLa, 
Rb, and Rc are related to their kernel counterparts. 

Based on the equivalence between the standard RKHS 
interpolator and the linear mean-square error estimator lETI . 
it is useful to re-visit the probabilistic framework and identify 
kernel similarities between slices of X with their mutual 
covariances. Focusing on the tube dimension of X, one can 
write Kpdj'jp) := E(Tr(Xj,Xp)), that is, the covariance 
between slices Xp/ and Xp taking (X, Y) := Tr(X-^Y) as 
the standard inner product in the matrix space. Under this 
alternative definition for Kp, and corresponding definitions for 
Ktv", and KjV4, it is shown in the Appendix that 

Km=O^Ra, K^^O^-Rb, Kv^e'^Rc (17) 
and that 9 is related to the second-order moment of X by 

E\\X\\j, = Re^. (18) 

Since sample estimates for K^vi, Ka/', Kp, and E||X||f 
can be readily obtained from the tensor data, (ITtI and (IT¥I 
provide an agile means of estimating Ra, Rb, and Rc without 
requiring PARAFAC decompositions over the set of training 
tensors. 

This strategy remains valid when kernels are not estimated 
from data. One such case emerges in collaborative filtering 
of user preferences U], where the similarity of two users is 
modeled as a function of attributes; such age or income. 



D. Block successive upper-bound minimization algorithm 

An iterative algorithm is developed here for solving (ITji . by 
cyclically minimizing the cost over A, B, and C. In the first 
step of the cycle the cost in ( fT3] l is minimized with respect to 
(w.rt.) A considering B and C as parameters. Accordingly, 
the partial cost to minimize reduces to 

/(A) i|| (Z-X)®Ai|| + |Tr(A^R;,iA) (19) 

where /i was identified with and substituted for cr^. Function 
( fT9] l is quadratic in A and can be readily minimized after re- 
writing it in terms of a := vec(A) [see Wt\ in the Appendix]. 
However, such an approach becomes computationally infea- 
sible for other than small datasets, since it involves storing 
P matrices of dimensions NM x MR, and solving a linear 
system of MRxMR equations. The alternative pursued here to 
overcome this obstacle relies on the so-called block successive 
upper-bound minimization (BSUM) algorithm |[20|. 

In BSUM one minimizes a judiciously chosen upper-bound 
(7(A, A) of /(A), which: i) depends on the current iterate 
A; ii) should be simpler to optimize; and iii) satisfies certain 
local-tightness conditions; see also 1201 and properties i)-iii) 
below. 

For A given, consider the function 

5(A,A):=i||(Z-X)®A|||. (20) 
+ ^ QTr (A^ A) - Tr(©^ A) + iTr(0^A)) 

where A := Ainax(R^^) is the maximum eigenvalue of R^^, 
and := AI — R^^. The following properties of g(A, A) 
imply that it majorizes /(A) at A, satisfying the technical 
conditions required for the convergence of BSUM (properties 
i)-iii) are established in the the proof of Lemma [T] in the 
Appendix). 

i) /(A)=5(A,A); 

ii) ^/(A)U=A = A5(A,A)|A=A;and, 

iii) /(A)<.g(A,A), VA. 

The computational advantage of minimizing .g(A, A) in 
place of /(A) comes from (?(A, A) being separable across 
rows of A. To see this, consider the Kathri-Rao product 
n := C B := [ci ® bi, . . . Cfl b^], defined by the 
column-wise Kronecker products ® b^. Let also matrix 
Z := [Zi, . . . , Zp] e nMxNP jjgjjQjg jjjg unfolding of Z along 
its tube dimension, and likewise for A := [Ai,...,Ap] e 
{0, lYi>'NP ^jj^j X := [Xi, . . . ,Xp] e ^^-f". Then, using 
the following identity ifTol 

X := [Xi,...,Xp] = An^. (21) 

it is possible to rewrite (l20t as 

5(A,A) :=i||(Z-An^) 

+ M (^Tr (A^ A) - Tr(0^ A) + iTr(0^A)) 
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Algorithm 1 : Low-rank tensor imputation (LRTI) 



function update_factor(A, R, n, A, Z, fi) 

Set A = An,ax(R"') 

Unfold A and Z over dimension of A into A and Z 

Set = (AI-R-^)A 

for m = 1, . . . , M do 

Select rows z^^, S"^^, and 0^-^, and set D„i = diag((5m) 
Compute a,„ = (n^D,„n + A^iI)"^(n^D„z,„+^i6»™) 
Update A with row 

end for 

return A 
end function 

Initialize A, B and C randomly. 

while I cost — cost_old| < e do 

A = update_factor(A,R.4, (C B), A, Z, fi) 
B = update_factor(B,Rs, (A C), A, Z, ^) 
C = update_factor(C,Rc, (B A), A, Z, ^) 
Recalculate cost in il3\ 

end while 

return X with shces Xp = Adiag(ep C)B^ 



which can be decomposed as 



M 



giA,A)=J2 

1=1 

(A| 



m—1 

2 



-||(5,„®z™ - diag(^m)na„j2 



(22) 



where z^, a^, Oj^, and a^, represent the m-th rows 
of matrices Z, A, A, 0, and A, respectively. Not only (l22l i 
evidences the separability of ( l20l i across rows of A, but it 
also presents each of its summands in a standardized quadratic 
form that can be readily minimized by equating its gradient 
to zero. Accordingly, the majorization strategy reduces the 
computational load to R systems of M equations that can be 
solved in parallel. Collecting the solution of such quadratic 
programs into the rows of a matrix A* yields the minimizer 
of (Onil, and the update A ^ A* for the BSUM cycle. Such 
a procedure is presented in Algorithm [1] where analogous 
updates for B and C are carried out cyclically. 

By virtue of properties i)-iii) in Lemma [T] convergence of 
Algorithm [T] follows readily from that of the BSUM algo- 
rithm 



Proposition 3: The iterates for A, B and C generated by 
Algorithm |7] converge to a stationary point of (113b . 

V. Inference for low-rank Poisson tensors 

Adoption of the LS criterion in (H) assumes in a Bayesian 
setting that the random Z is Gaussian distributed. This section 
deals with a Poisson-distributed tensor Z, a natural alternative 
to the Gaussian model when integer-valued data are obtained 
by counting independent events ITOl. Suppose that the entries 
Zrnnp of Z are Poisson distributed, with probability mass 
function 



kl 



(23) 



and means given by the corresponding entries in tensor X- For 
mutually-independent {zmnp}, the log-hkehhood /a^(Z;X) of 



X given data Z only on the entries specified by A, takes the 
foiTn 

M N P 

^_A_(Z , X) — ^ ^ ^ ^ ^ ^ ^mnpi^mnp lo§(*^mnp) -^mnp] 



m—1 n— 1 p— 1 



(24) 

after dropping terms log(zm„p!) that do not depend on X. 

The choice of the Poisson distribution in ( |23] l over a Gaus- 
sian one for counting data, prompts minimization of the K-L 
divergence ( |24] | instead of LS as a more suitable criterion llTOl . 
Still, the entries of X are not coupled in ( l24l i. and a binding 
PARAFAC modeling assumption is natural for feasibility of the 
tensor approximation task under missing data. Mimicking the 
method for Gaussian data, (nonnegative) Gaussian priors are 
assumed for the factors of the PARAFAC decomposition. Ac- 
cordingly, the MAP estimator of X given Poisson-distributed 
data (entries of Z indexed by A) becomes 

MNP 

Z := art 



{x,A,B.c}er 



m—1 71 — 1 p— 1 

+ I [Tr (A'^R;^! A)+Tr (B^R^^Bj+Tr (C^Rp^C)] (25) 

over the feasible set r:={X, A, B, C : A > 0, B > 0, C > 
0, Xp = Adiag [ejc] B^, p 1, . . . , P), where the symbol 
> should be understood to imply entry-wise nonegativity. 

With the aid of Representer's Theorem, it is also possible 
to interpret (IZST i as a variational estimator in RKHS, with K-L 
analogues to (fT4li-(fT6]l, so that the conclusions thereby regard- 
ing smoothing, prediction and prior covariance estimation carry 
over to the low-rank Poisson imputation method (l25l l. 

A. Block successive upper-bound minimization algorithm 

A K-L counterpart of the LRTI algorithm is developed in this 
section, that provably converges to a stationary point of 
via an alternating-minimization iteration which optimizes 
sequentially w.rt. one factor matrix, while holding the others 
fixed. 

In the sequel, the goal is to arrive at a suitable expression 
for the cost in dZSl l. when viewed only as a function of e.g., 
A. To this end, let matrix Z := [Zi,...,Zp] e n^^i^-NP 
denote the unfolding of Z along its tube dimension, and 
likewise for A := [Ai,...,Ap] G {0,1}*^^^-^ and X := 
[Xi, . . . , Xp] e R*^^^^. Based on these definitions, ^ can 
be written as 



Za(Z;X) = li'^(A®[X-Z®log(X)])l 



NP 



(26) 



where 1m, Inp are all-one vectors of dimensions M and NP 
respectively, and log( ) should be understood entry-wise. The 
log-likelihood in ( |26] | can be expressed in terms of A, and the 
Kathri-Rao product 11 := B C by resorting again to (l2Tl i. 
Substituting i2T[ into ( |26] ) one arrives at the desired expression 
for the cost in dZSl l as a function of A, namely 

/(A) := ll',(A®[An - Z®log(An^)])ljvp 
+ ^Tr(A-R-A). 

A closed-form minimizer A* for /(A) is not available, but 
since /(A) is convex one could in principle resort to an 
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Algorithm 2 : Low-rank Poisson-tensor imputation (LRPTI) 
function update_factor(A, R, n, A, Z, ^) 

Set A = An,ax(R"') 

Unfold A and Z over dimension of A into A and Z 
Compute S = ^® ( An'^ ^) (element- wise division) 
Compute T = (/i(AI - R'^)A - An) 

Update A with entries a^r = tmr + V^'mr + Smr 
return A 
end function 

Initialize A, B and C randomly, 
while I cost — cost_oldj < e do 

A = update_factor(A,Ra, (C B), A, Z, ^) 

B = UPDATE_FACTOR(B,Rfl, (A C), A, Z, ^) 
C = UPDATE_FACTOR(C,Rc, (B A), A, Z, /i) 
Recalculate cost in ( 125b 
end while 



return X with slices Xr 



AdiagI 



e^C)B^ 




iterative procedure to obtain A*. To avoid extra inner iterations, 
the approach here reUes again on the BSUM algorithm ||201 . 
For A given, consider the separable function 

M.R 2 

.g(A, A) := /.jA ^ ^ '^Unramr - Smr log(a„ir) + U„ir^ 

m,r— 1 

(27) 

where A := Amax(R-^^) is the largest eigenvalue of R^^, and 
the parameters s, m, irm, and u„n are defined in terms of A, 
Z, A, n, and := (AI - R^^) A by 

J- ^rnk ^mk^rnr '^kr 

k—l / 'r' —1 "'^^ fey 
^ NP 

^ k=l 

and Umr • — {^mr^nir ^^/j;— i <^mfc^mfc^mr^fcr^mrA:^ i 

with Vrnrk ^Og{a,nrT^kr / J2r' = 1 ^rnr'T^kr')/ J2f' = l ^mr'T^kr' ■ 

As asserted in the following lemma, g(A, A) majorizes /(A) 
at A and satisfies the technical conditions required for the 
convergence of BSUM (see the Appendix for a proof.) 

Lemma 1: Function g{A., A) satisfies the following properties 

i) /(A) =5(A,A); 

ii) ^/(A)U=A = ^3(A,A)|A^A;«««f. 

iii) /(A)<.g(A,A), VA. 

Moreover, g{A, A) is minimized at A ^ A* with entries 

Lemma [T] highlights the reason behind adopting g{A, A) in the 
proposed block-coordinate descent algorithm: it is separable 
across the entries of its matrix argument [cf. (l27i l. and hence 
it admits a closed-form minimizer given by the MR scalars 
at The updates A <— A* are tabulated under Algorithm 
|2] for solving dZSl l. where analogous updates for B and C are 
carried out cyclically. 

By virtue of properties i)-iii) in Lemma [T] convergence 
of Algorithm |2] follows readily from the general convergence 
theory available for the BSUM algorithm ||20l . 

Proposition 4: The iterates for A, B and C generated by 
Algorithm |2] converge to a stationary point of ( 125b . 



Fig. 3. Performance of the low-rank tensor imputation method as function 
of the regularizing parameter fi; (top) rank of the tensor as recovered by (|8] 
averaged over 100 test repetitions, (bottom) relative recovery error 

A related algorithm, abbreviated as CP-APR can be found 
in ifTOl , where the objective is to find the tensor's low-rank 
factors per se. The LRPTI algorithm here generalizes CP-APR 
by focusing on recovering missing data, and incorporating 
prior information through rank regularization. In terms of 
convergence to a stationary point, the added regularization 
allows for lifting the assumption on the linear independence of 
the rows of 11, as required by CP-APR iflOl - an assumption 
without a straightforward validation since iterates 11 are not 
accessible beforehand. 

VI. Numerical Tests 
A. Simulated Gaussian data 

Synthetic tensor-data of dimensions M x N x P = 16 x 
4x4 were generated according to the Bayesian tensor model 
described in Section |IV] Specifically, entries of Z consist of 
realizations of Gaussian random variables generated according 
to ( fTTT i. with means specified by entries of X and variance 
scaled to yield an SNR of — 20dB . Tensor X is constructed 
from factors A, B and C, as in Matrices A, B, and C 
have i? = 6 columns containing realizations of independent 
zero-mean, unit-variance, Gaussian random variables. 

A quarter of the entries of Z were removed at random and 
reserved to evaluate performance. The remaining seventy five 
percent of the data were used to recover Z considering the 
removed data as missing entries. Method (jS) was employed 
for recovery, as implemented by the LRTI Algorithm, with 
regularization ^(|| A|||, + j|B|l|, + ||C|||,) resulting from setting 
R.A = 1m, Rb = In, and Rc ~ Ip- 

The relative recovery error between Z and data Z was 
computed, along with the rank of the recovered tensor, as a 
measure of performance. Fig. |3] depicts these figures of merit 
averaged over 100 repetitions of the experiment, across values 
of /.( varying on the interval 10~^/imax to /imax, which is 
computed as in Corollary [T] Fig |3] (bottom) shows that the 
LRTI algorithm is successful in recovering the missing entries 
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Fig. 4. Performance of the low-rank Poisson imputation method as function 
of the regularizing parameter fi; (top) rank of the recovered tensor averaged 
over 100 test repetitions, (bottom) relative recovery error. 

of Z up to — lOdB for a wide range of values of /i, presenting 
a minimum at = This result is consistent with 

Fig. [3] (top), which shows that rank i?* = 6 is approximately 
recovered at the minimum error Fig. |3](top) also corroborates 
the low-rank inducing effect of (O, with the recovered rank 
varying from the upper bound R = NP = 16 to i? = 0, as /i 
is increased, and confirms that the recovered tensor is null at 
Z^max as asserted by Corollary [T] 

B. Simulated Poisson data 

The synthetic example just described was repeated for the 
low-rank Poisson-tensor model described in Section |V] Specif- 
ically, tensor data of dimensions M x N x P = 16x4x4 
were generated according to the low-rank Poisson-tensor model 
of Section |Vl Entries of Z consist of realizations of Poisson 
random variables generated according to (l23l . with means 
specified by entries of X- Tensor X is again constructed as 
in ^ from factors A, B and C having i? = 6 columns, 
containing the absolute value of realizations of independent 
Gaussian random variables scaled to yield E[a;mnp] = 100. 
Half of the entries of Z were considered missing to be 
recovered from the remaining half. Method (l25t was employed 
for recovery, as implemented by the LRPTI Algorithm, with 
regularization f (||A1||, + ||B|||, + ||C|||,). 

Fig. |4] shows the estimated rank and recovery error over 100 
realizations of the experiment, for /i in the interval 0.01 to 
100. The recovery error in Fig. 2](bottom) exhibits a minimum 
of — 15dB at /J, = 1, where the rank i?* = 6 is recovered 
[cf. Fig. |4] (top).] The low-rank inducing effect of (O is again 
corroborated by the decreasing trend in Fig. E](top), but in this 
case the rank is lower bounded by i? = 1, because the K-L 
fitting criterion prevents (|25t from yielding a nuU estimate Z. 

C. MRI data 

Estimator ( fT4] i was tested against a corrupted version of the 
MRI brain data set 657 from the Internet brain segmentation 



repository lITSl . The tensor Z to be estimated corresponds to 
a three-dimensional MRI scan of the brain comprising a set 
of P = 18 images, each of Af x iV = 256 x 196 pixels. 
Fifty percent of the data is removed uniformly at random 
together with the whole slice Z„, n = 50. Fig. |5] depicts the 
results of applying estimator (fT4t to the remaining data, which 
yields a reconstruction error of — 10.54dB. The original slice 
Zp, p = 5, its corrupted counterpart, and the resulting estimate 
are shown on top and center left. Co variance matrices K^vf, 
K and Kp are estimated from six additional tensor samples 
containing complementary scans of the brain also available at 
ifTSl . Fig.|5](center right) represents the covariance matrix K^v 
for column slices perpendicular to Zp, showing a structure 
that reflects symmetries of the brain. This correlation is the 
key enabler for the method to recover the missing slice up 
to — 9.60dB (see Fig. |5] (bottom)) by interpolating its a priori 
similar parallel counterparts. 

All in all, the experiment evidences the merits of low-rank 
PARAFAC decomposition for modeling a tensor, the ability of 
the Bayesian estimator (fTSl l in recovering missing data, and the 
usefulness of incorporating correlations as side information. 

On account of the comprehensive analysis of three-way MRI 
data arrays in fSl, and the nonnegative PARAFAC decompo- 
sition advanced thereby, inference of tensors with nonnegative 
continuous entries will be pursued as future research, combin- 
ing methods and algorithms in sections |IV] and |V] of this paper 



D. RNA sequencing data 

The RNA-Seq method described in fTS) exhaustively counts 
the number of RNA transcripts from yeast cells. The reverse 
transcription of RNA molecules into cDNA is achieved by 
P = 2 alternative methods, differentiated by the use of oligo- 
dT, or random-hexonucleotide primers. These cDNA molecules 
are sequenced to obtain counts of RNA molecules across 
AI = 6, 604 genes on the yeast genome. The experiment was 
repeated in ifTSl for a biological and a technological replicate 
of the original sample totalling iV = 3 instances per primer 
selection. The data are thus organized in a tensor of dimensions 
6, 604 X 3 X 2 as shown in Fig. |6] (top), with integer data that 
are modeled as Poisson counts. Fifteen percent of the data is 
removed and reserved for assessing performance. The missing 
data are represented in white in Fig. |6] (center). 

The LRPTI algorithm is run with the data available in Fig. 
|6] (center) producing the recovered tensor depicted in Fig. 
(bottom). The recovery error for this experiment was — 15dB. 

VII. Concluding summary 

It was shown in this paper that regularizing with the 
Frobenius-norm square of the PARAFAC decomposition fac- 
tors, controls the tensor's rank by inducing sparsity in the 
vector of amplitudes of its rank-one components. A Bayesian 
method for tensor completion was developed based on this 
property, introducing priors on the tensor factors. It was ar- 
gued, and corroborated numerically, that this prior information 
endows the completion method with extra capabilities in terms 
of smoothing and extrapolation. It was also suggested through 
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Fig. 5. Results of applying )14t to the MRI brain data set 657. (top) original 
and recovered fibers Zp and Zp for p = 5. (center) input fiber Zp, p = 5 
with missing data, and covariance matrix Kyv/. (bottom) original and recovered 
columns Z„ and Z„ for the position n = 50 in which the whole input slice 
is missing ) 



a parallelism between Bayesian and RKHS inference, that 
the prior covariance matrices can be obtained from (sample) 
correlations among the tensor's slices. In such a probabilis- 
tic context, generic distribution models for the data lead to 
multiple fitting criteria. Gaussian and Poisson processes were 
especially considered by developing algorithms that minimize 
the regularized LS and K-L divergence, respectively. 

Numerical tests on synthetic data corroborated the low-rank 
inducing property, and the ability of the completion method to 
recover the "ground-truth" rank, while experiments with brain 
images and gene expression levels in yeast served to evaluate 
the method's performance on real datasets. 

Although the results and algorithms in this paper were 
presented for three-way arrays, they are readily extendible to 
higher-order tensors or reducible to the matrix case. 



Appendix 

I. Proof of Proposition [T] 

Proof: a) The equivalence of (|2]i and (|4|i results immedi- 






Fig. 6. Imputation of sequencing count data via LRPTI; (top) original data; 
(center) data with missing entries; (bottom recovered tensor. 



ately from ((Sj. Indeed, if ^ is minimized in two steps 



mill — min 11 fZ — X) 



fdici 



l|B||^) 



s. to CB' 



X 



(28) 



it is apparent that the LS part of the cost does not depend on 
the inner minimization variables. Hence, (l2Ft can be rewritten 

as 

1 . „o . A*. 



mm -j|(Z - X)®Al|i 
X 2 



mm 

B,C 2 
s. to CB^=X 



|C||^ 



|B|||) 



(29) 



and by recognizing (O as the inner problem in (|29^ . the 
equivalence follows. 

b) Consider the cost in ^ at the local minimum (B, C) 



C/(B,C) 



^ll(Z-X)(: 



f(l|C||^ 



l|B||^) 



where X := BC"^. Arg uing by contradiction, suppose that 
there is a different local minimum (B, C) such that ?7(B, C) ^ 
U{B,C), and without loss of generality set U{B,C) < 
C/(B,C), so that dU := U{B,C) - U{B,C) < 0, which 
can be expanded to 

dU = Tr [(A®(Z - X)) (A®(X - X))] + |1 A®(X - X)ll|, 



(l|C| 



icill^ 



|B||^ 



l|B|||)<0. 



(30) 



Setting this inequality aside for now, consider the augmented 
matrix Q in terms of generic B and C matrices: 



Q 



B 

C 



B^ 



BB^ 

X^ 



X 

cc^ 



(31) 



and the corresponding Q defined in terms of B and C. 
For each value of 6* G (0, 1) consider the convex combination 



Qe :=Q + 6i(Q Q). 



(32) 



As both Q and Q are positive semi-definite, so is Qe and by 
means of the Choleski factorization one obtains 



Qe 



Be 
Co 



BgBg 

X' 



(33) 
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which defines Bg, Cg and Xg. 

Expanding the cost difference dUe as in ( l30l ) resuhs in 

dUe :=C/(Be,Ce)-C/(B,C) 

= Tr [(A®(Z - X)) (A®(X - Xg))] 



icill^ 



|B||2- 



\A®iX-Xg)\\%. 



From the definitions (EB-dSU it follows that X - Xe = 

0(X-X), ||B,|||,-||B!|2, = 0(||B|||,-||B|||,), and ||C,|||,- 
I|C1||. = (?(1|C||2,-||C|||,), so that 

dUg := OTi- [(A®(Z - X)) (A®(X - X))] 

+ ^(l|Ci||-||Cip^ + ||B!p^-||B||^) 

+ 02||A®(X-Xe)||| 

and thus, it can be put in terms of ( |30] | as in 

dUg := e {dU - II A®(X - Xe)||2,) + 9^A®{X - Xg)\\l. 

If dU were strictly negative, so would dU — ||A®(X — 



Xe)|||,, and hence 

lim ^dUg - {dU - II A® (X - Xg)\\l) < 0. 

but then there is in every neighborhood of (B, C) a point 
(Be, Cg) such that U{Bg, Cg) < C/(B, C), B, C cannot be a 
local minimum. This contradiction implies that ?7(B, C) = 
J7(B, C) for any pair of local minima, which proves the 
statement in part b) of Proposition [T] ■ 
II-Equivalence of tensor completion problems 

Proof: The Frobenius square-norms of A, B, and C are 
separable across columns; hence, the penalty in (|8) can be 
rewritten as 



|A||?- + ||B||^- 



by defining := ||a,.||, b,. := ||b,.|| , := ||c,.||, r = 

On the other hand, X can be expressed w.l.o.g. in terms of 
the normalized outer products (|6) with jr ■= UrbrCr- Substi- 
tuting (|6]l and ( l34b for the tensor and the penalty respectively, 
^ reduces to 

min mill min — 1| (Z — X) ®A||p 

{u},{v},{w} 7 {a^},{br}dcr}2"^- —' — 



R 

M \ ' 2 



bl + d 



R 



S. to X = 7r(Ur O V,. O Wr) 



T = \ 



(35) 



Focusing on the inner minimization w.rt. norms a^, br, 
and Cr for arbitrary fixed directions {u^}, {v^}, and {wj.}, 
and fixed products 7^ := UrbrCr- The constraints and hence 



the LS part of the cost depend on 7^ only, and not on their 
particular factorizations arb,.Cr. Thus, the penalty is the only 
term that varies when 7^ is constant, rendering the inner-most 
minimization in ([35]) equivalent to 



mm 



bl 



(36) 



The arithmetic geometric-mean inequality gives the solution 
to ( l36l ), as it states that for scalars a^, b^ and 6^, it holds that 



^/a^.<{l/i){al + bl + cl) 



with equality when 
is attained at ai =b1 - 



bl 
^2 _ 



2/3 

It ■ 



Substituting the corresponding X]r=i(^r + + c^) = 
"iY^f^i^V^ ~ 3II7II2/3 * ^35T l yields Equivalence of 
the optimization problems is transitive; hence, by showing that 
both (|9]l and ^ equivalent to ( [35] ) proves them equivalent to 
each other, as desired. ■ 

III. Proof of Corollary [l] 

Proof: The following result on the norm of the matrix 
inverse will be used in the proof of the corollary. 

Lemma 2: 413 /7.5S7 //E e i?"^™ satisfies ||E||_f < 1, then 
I + E ii invertible, and ||(I + E)"i||^ < (1 - I|E||f)""^- 

Another useful inequality holds for any value of /i, and for 
A, B, and C being the minimizers of dH) 



B 



11-) 



< ||A®Z 



(37) 



as it follows from comparing the cost at such a minimum, and 
at the feasible point (A, B, C) = (0, 0, 0). 

A second characterization of the minimum of ((8) will be 
obtained by equating the gradient to zero. By vectorizing 
matrix A, the cost in ([8]) can be rewritten as 



p=i 



|C||^ = 5]||a,,||^ + ||b,,|r + ||c,,||^ 

r=l 
R 

= "^al + bl + cl (34) where z. 



- ||diag[^p] (zp - (Bdiag[eJC] ® I))a) 



|2 , -"11 I 



(38) 



8p, and a denote the vector rearrangements of ma- 
trices Zp, Dp , and A, respectively. Additional regularization 
that vanishes when taking derivatives w.r.t. A were removed 
from ( [38] l. Setting the gradient of ((38) w.rt. a to zero, yields 



a= (I + E)-iC 



with 



1 P 

E := - ^ (B^diag[eJC] ® I) diag[^p] (Bdiag[eJC] ® I) 

C := - ^ (B^diag[eJC] ® I) diag[^p]zp. 

The norms of E and C, can be bounded by using the sub- 
multiplicative property of the norm, and the Cauchy-Schwarz 
inequality, which results in 

||E||F<i||B|||||C||?. 

IICI|f< -||A®Z||j.||B||;.||C||j.. 
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Then according to the previous lemma, if /i is chosen large 
enough so that ||E||f < 1 then the norm of A is bounded by 



A||f = ||a||2 < in 



|B|||.||C|||)-i||Bi 



i.i|C||F||A®Z||F 
(39) 



which constitutes the sought second characterization of the 
minimum of ([8]). 

Yet a third characterization was obtained during the proof of 
Proposition |2] in which the norm of the factor columns were 
shown equal to each other, so that 



IAIIf = llBilp = lie 



(40) 



(41) 
(42) 



Substituting gOll into (O and ^ yields 

||A|||, < ||A®Z|||/3Ai 
||A||^<(M-i|A||4)-i||A|l||lA®Z|l;^. 

Form (|42] |. two cases are found possible; 



case 1: ||Aj|F = 0; and 

case 2: 1 < (1 - \\A\\%/ fi)-^\\A\\F\\A®Z\\F/ f^- (43) 

To argue that the second case is impossible, substitute (ITTT i 
into ( |43] | and square the result to obtain 



1<(1-| 



A®Z|||,/V) 



-2j|A®Z|||,/3M^ 



(44) 



But by hypothesis fi > \\A®Zl\\f^ '^^at 1| A®Z||^//^^ < 1, 
and the right-hand side of (l44l i is bounded by 0.43, so that the 
inequality does not hold. This implies that the first case in ( l43] i: 
i.e., ||A||i? = 0, must hold, which in accordance with ( l40l i. 
further implies a null solution of (|8). That was the object of this 
proof. Still, the bound at 0.43 can be pushed to one by further 
reducing ^, and the proof remains valid under the slightly 
relaxed condition fi > (18/(5 + \/2T))-i/3|| A®Z||;^'^' ~ 
0.81||A®Z||^/^ ■ 
IV-RKHS imputation 

Recursive appUcation of Representer's Theorem yields finite 
dimensional representations for the minimizers a^, br, and Cr 
of (fT4l i. given by 

ar{m) = X]f,!'=i"™'^Ai("i',m) 



br{n) = Y.n'=iPrn'ku{n' -.n) 



Defining vectors k'^{m) := [fc^ (1, m), . . . , fc>i (Af, m)], and 
correspondingly kjj-{n) := [kj\f{l, n), . . . , kj\f{N, n)], and 
kp{p) [k-p{l,p), . . . , k-p{P,p)], along with matrices A G 
]gMxfl . ^(to^^) (^^^^^ B G R^^-R : 5(71, r) := /3„^, and 
C G K-f"^-" : C{p, r) 7p^, it follows that 



fR{m,n,p) = y^^ar{m)br{n)cr{p) 



feX,(m)Adiag k^ip)C B^fc^(n). (45) 



Matrices A, B, and C are further obtained by solving 

p 



min II (Zp - KA4Adiag [ejKpC] B^K^r) ® A 



A.C.B 



pWf 



+ 1 (trace(ATKA4A) +tracc(B'^KA/-B) + trace(C^KpC)) 



s. to A G 



r,MxR 



, B G 



nNxR 



C G 



pPxfl 



which is transformed into (fT6] l by changing variables A = 
K>f A, B = Ka/'B, and C = K-pC, just as (|45] | becomes 

(Us). 

V-Covariance estimation 

Inspection of the entries of Kp(p,p') := E [Tr(XjXp/)] 
under the PARAFAC model, yields 



Kr{p,p') 



E 



Tr I ^ b,,Cr(p)a^ ^ a^-.c^- (p')b,.' 

\r— 1 r' — l t 

R 

^ ^ E (c^(p)c,Kp')) E (b^,b,) E (a^a,-) 

r— 1 r' — l 
R 

5]E(c,(pK(p'))E|lb,|l2E|la,f 



R 



= ^Rc(p,p')Tr(Rij)Tr(R^) 

= i? Rc(p,p')Tr(RB)Tr(RA) 

which, after summing over p' = p, yields 

p p 
E|lX|l|, = ^E||X,|||, = 5^Rp(p,p) 

p— 1 p— 1 

= i?Tr(Rc)Tr(RB)Tr(R^). 



(46) 



In addition, by incorporating the equal power assumption 
( fTSl i. equation ( |46] l further simplifies to 



E||X|1 



as stated in ( fTsT l. 
VI - Vector form of (O 

The vec operator can be combined with the Kronecker 
product to factorize vec(AQ-'") = (Q I) vec (A), and with 
the Hadamard product to convert it to a standard matrix 
product vec(A®A) = diag(vec(A))vec(A). Using these two 
properties, ([19} can be put in terms of a := vec(A) as in 



1 P 

/(a) - ^ ||diag(vec(Ap)) (vec(Zp) - Bdiag(e^C)a) 



^a(I(g)R^^).a 



(47) 



VII - Proof of Lemma [T] 

Proof: Function g(A, A) in (l27T i is formed from /(A) af- 
ter substituting ,gi(A, A) for /i(A), and 52(A, A) for /2(A), 
respectively, as defined by 

/i(A) :=Tr(A^R^iA) (48) 
31 (A, A) := ATr ( A'^A) - 2Tr(©^ A) + Tr(©^ A) (49) 
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II 



where A := Aniax(R-^^) and © := AI — R^'^, and 



/2(A) := -lMA®Zlog(An^ )1np 

M NP 

52 (A, A) := - ^ SmkZmkanikr log 

m— 1 A:— 1 



^mkr 



(50) 



(51) 



with Umkr amr'T^kr' / J2r' = 1 amr'T^kr'- 

Hence, properties i)-iii) will be satisfied by the pair of 
functions g(A,A) and /(A) in Lemma [T] as long as they 
are satisfied both by the pair in (|48]l-(|49]l and that in (|50]|-(|5T]). 

Focusing on the first pair, both functions are separable per 
column of A and A, and their difference takes the form 



.91 (A, A)-/i(A)=^[Aa^a,. - 20ja,. + 0ja^ - a^R^^^a^] 

r=l 
R 

= ^(a^ - ar)^(AI - R^^)(ar - a^) 



which is positive and, together with its gradient, vanish at A. 
This establishes that properties i)-iii) are satisfied by .gi(A, A) 
and /i(A)), and thus they are so for functions g{A,A) and 
/(A)) in @ and (O. 

Considering the second pair, and expanding /2(A) yields 



M NP 



R. 



/2(A) = - ^ ^(^mfe^mfclog ^ amr'T^kr' (52) 
m=l \r' = l ) 

where the logarithm can be rewritten as (see also llTOl ) 

log ^ amr'TTfcr' = log ^ Q^mfcr' (53) 



> ^ amfcr log ^ 



(^mr' '^kr' 
^mkr 



(54) 



and the inequality holds because of the concavity of the 
logarithm with an argument being a convex combination with 
coefficients {amkr}^=i summing up to one. 

Since substituting for (|53ll in results in (jSD, it 
follows that 92 (A, A) and /2(A) satisfy property iii). The 
proof is complete after evaluating at the pair of functions and 
their derivatives at A to confirm that properties i) and ii) hold 
too. 

The minimum a* „^ ;= tmr + ^/tmr + ^mr is obtained read- 
ily after equating to zero the derivative of the corresponding 
summand in (|22] |. and selecting the nonnegative root. ■ 
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